Skip to content

This PR adds a new Wazuh integration for Wazuh decoder rule generation tool#79

Merged
MiguelCasaresRobles merged 82 commits into
wazuh:mainfrom
Hasitha9796:main
Jun 8, 2026
Merged

This PR adds a new Wazuh integration for Wazuh decoder rule generation tool#79
MiguelCasaresRobles merged 82 commits into
wazuh:mainfrom
Hasitha9796:main

Conversation

@Hasitha9796

@Hasitha9796 Hasitha9796 commented May 1, 2026

Copy link
Copy Markdown
Member

Summary

This PR adds a new integration named wazuh_decoder_rule_tool — a FastAPI-based tool for analyzing logs, checking existing Wazuh decoder/rule matches through wazuh-logtest, and generating custom decoder and rule XML.

New Features

AI-Powered Generation (Hybrid Approach)

  • Hybrid architecture: programmatically generates correct Wazuh decoder XML, then uses an LLM to review and improve osregex patterns
  • Multiple AI providers: Ollama (local, no rate limits), DashScope (Qwen 3.6 Plus), and OpenRouter
  • wazuh-logtest integration: every AI generation first checks wazuh-logtest to determine:
    • Whether a custom decoder is needed at all
    • The correct parent strategy (<program_name> when available, <prematch> otherwise)
    • Which fields are already decoded by built-in decoders (skipped automatically)
  • Priority fallback: Ollama > DashScope > OpenRouter

Enhanced ML Decoder Similarity

  • Ensemble model combining TF-IDF (exact token matching) + SBERT (semantic similarity)
  • Configurable weighting (default: 40% TF-IDF, 60% SBERT)
  • Enhanced tokenization preserving regex patterns
  • Backward compatible with existing TF-IDF fallback

Improved Decoder Generation

  • Split decoders: one child decoder per field for better accuracy
  • Robust prefix generalization (timestamps, IPs, MAC addresses, PIDs)
  • CEF (Common Event Format) log support with field mapping
  • Per-field validation explaining which fields will/won't be decoded
  • Multiple log type handlers: syslog, JSON, key=value, bracketed, Java dash, Android, Palo Alto CSV

Robustness & Reliability

  • Timeouts on all git subprocess calls (clone, pull, sparse-checkout) to prevent startup hangs
  • Proper Wazuh OS_Regex validation (no PCRE patterns, correct \. vs . semantics)
  • Non-blocking SSH with configurable timeouts

Included

  • FastAPI backend with streaming AI responses
  • Single-page HTML/JS UI with decoder analysis, rule generation, AI generation, and testing
  • Log analysis using heuristics with regex generation engine for Wazuh OS_Regex compatibility
  • wazuh-logtest validation (local or remote via SSH)
  • ML-based decoder similarity (TF-IDF + optional SBERT ensemble)
  • Rule ML model trained from wazuh-ruleset
  • Per-field feedback collection for continuous improvement
  • README with comprehensive setup instructions including AI provider configuration

Testing

The app can be tested locally:

  1. Set up the virtual environment:
cd integrations/wazuh_decoder_rule_tool
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
  1. Generate SSL certificates:
mkdir -p certs
openssl req -x509 -newkey rsa:4096 -keyout certs/localhost.key -out certs/localhost.crt -days 365 -nodes -subj "/CN=localhost"
  1. Start the application (with AI):
export OLLAMA_BASE_URL=http://localhost:11434/v1
export OLLAMA_MODEL=llama3.2:3b
uvicorn app.main:app --host 0.0.0.0 --port 8443 --ssl-certfile certs/localhost.crt --ssl-keyfile certs/localhost.key

Access the application via https://localhost:8443.

Connecting to Wazuh VM for wazuh-logtest

export WAZUH_SSH_HOST=192.168.56.10
export WAZUH_SSH_PORT=22
export WAZUH_SSH_USER=vagrant
export WAZUH_SSH_PASSWORD=vagrant

Example Scenario

  1. Paste a log like: May 19 12:34:56 custom-server myapp[1234]: User 'admin' failed to authenticate from IP 192.168.1.100 due to invalid_password
  2. Click Analyze to detect log type and extract fields
  3. Select fields to extract (e.g., user, srcip)
  4. Click Generate for programmatic decoder+rule generation
  5. Click AI Generate for AI-assisted pattern improvement

Hasitha9796 and others added 30 commits August 20, 2025 17:33
…tterns and full preceding words instead of truncating prefixes
…d generalize them to \d+ to prevent brittle anchors

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new wazuh_decoder_rule_tool integration: a FastAPI-based UI/API for analyzing pasted logs, optionally validating them via wazuh-logtest, and generating Wazuh decoder/rule XML. It also adds an “enhanced” ML decoder-similarity approach (TF‑IDF + SBERT) plus scripts/datasets to train a custom similarity model from Wazuh ruleset test data.

Changes:

  • Add the FastAPI app’s HTML/JS/CSS frontend and supporting backend utilities for decoder/rule generation workflows.
  • Add ML enhancements: ensemble similarity model wrapper, dataset builder + training script, and accompanying tests/docs.
  • Add local datasets and TLS artifacts for local HTTPS testing (currently including private keys).

Reviewed changes

Copilot reviewed 21 out of 26 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
integrations/wazuh_decoder_rule_tool/tests/test_ml_enhanced.py Adds unit tests for enhanced ML similarity components.
integrations/wazuh_decoder_rule_tool/tests/test_integration.py Adds a basic integration test for enhanced ML model loading.
integrations/wazuh_decoder_rule_tool/scripts/train_similarity.py Adds SBERT contrastive training script for decoder similarity.
integrations/wazuh_decoder_rule_tool/scripts/build_dataset.py Adds script to build training/validation datasets from Wazuh rules-testing suites + feedback.
integrations/wazuh_decoder_rule_tool/requirements.txt Adds Python dependencies for running the tool (FastAPI/Uvicorn/ML libs).
integrations/wazuh_decoder_rule_tool/README.md Documents local HTTPS run instructions, remote VM mode, and ML training workflow.
integrations/wazuh_decoder_rule_tool/ML_ENHANCEMENT_SUMMARY.md Documents ML feature-engineering + ensemble approach and future tuning ideas.
integrations/wazuh_decoder_rule_tool/key.pem Adds a private key file (should not be committed).
integrations/wazuh_decoder_rule_tool/generated/decoders/local_myapp_decoder_20260307094900.xml Adds generated decoder XML output artifact.
integrations/wazuh_decoder_rule_tool/generated/decoders/local_myapp_decoder_20260307094544.xml Adds generated decoder XML output artifact (duplicate-style).
integrations/wazuh_decoder_rule_tool/data/datasets/val.jsonl Adds validation dataset records for ML training.
integrations/wazuh_decoder_rule_tool/data/datasets/feedback.jsonl Adds feedback dataset examples used for training/tuning.
integrations/wazuh_decoder_rule_tool/data/datasets/feedback_rejections.jsonl Adds rejected feedback examples for analysis/training workflows.
integrations/wazuh_decoder_rule_tool/certs/localhost.key Adds a private TLS key for local HTTPS (should not be committed).
integrations/wazuh_decoder_rule_tool/certs/localhost.crt Adds a self-signed TLS certificate for local HTTPS.
integrations/wazuh_decoder_rule_tool/cert.pem Adds a certificate artifact for local HTTPS usage.
integrations/wazuh_decoder_rule_tool/app/wazuh_logtest.py Adds a helper to run wazuh-logtest via SSH (currently hardcoded/inconsistent).
integrations/wazuh_decoder_rule_tool/app/templates/index.html Adds the single-page HTML UI for the tool.
integrations/wazuh_decoder_rule_tool/app/static/styles.css Adds styling for the UI.
integrations/wazuh_decoder_rule_tool/app/static/app.js Adds UI logic for navigation, generate/test flows, ML status, AI generation, feedback, history.
integrations/wazuh_decoder_rule_tool/app/decoder_ml.py Adds baseline TF‑IDF similarity models + parsing utilities for decoders/rules.
integrations/wazuh_decoder_rule_tool/app/decoder_ml_enhanced.py Adds enhanced feature engineering + ensemble similarity model + compatibility wrapper.
integrations/wazuh_decoder_rule_tool/.gitignore Adds ignores for venv/cache/model/repo directories.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

function toggleConditionsRow() {
const req = document.getElementById('ruleRequirement').value.trim();
document.getElementById('ruleFieldConditionsRow').style.display = req ? 'flex' : 'none';
document.getElementById('ruleMatchConditionsRow').style.display = req ? 'flex' : 'none';
Comment on lines +18 to +35
try:
# This might fail if no Wazuh repo is available, but that's OK for this test
model = ensure_ml_model_enhanced(force_refresh=False, use_ensemble=True)
# If we get here without exception, the function works
assert model is not None or model is None # Either is fine
print("✓ ensure_ml_model_enhanced executed successfully")
return True
except Exception as e:
print(f"✗ ensure_ml_model_enhanced failed: {e}")
return False


if __name__ == "__main__":
success = test_ensure_ml_model_enhanced()
if success:
print("Integration test passed!")
else:
print("Integration test failed!")
parts.extend([self.prematch] * int(prematch_weight))
if self.regex:
# Extract meaningful tokens from regex
regex_tokens = re.findall(r'\[\\w\+\\]|\\\\d\+|\\\\S\+|\\\\w\+', self.regex)
Comment on lines +38 to +51

parts = []
if self.name:
parts.extend([self.name] * int(name_weight))
if self.program_name:
parts.extend([self.program_name] * int(program_weight))
if self.prematch:
parts.extend([self.prematch] * int(prematch_weight))
if self.regex:
# Extract meaningful tokens from regex
regex_tokens = re.findall(r'\[\\w\+\\]|\\\\d\+|\\\\S\+|\\\\w\+', self.regex)
parts.extend(regex_tokens * int(regex_weight))
if self.order:
parts.extend(self.order * int(order_weight))
Comment on lines +1 to +5
-----BEGIN PRIVATE KEY-----
MIIJQgIBADANBgkqhkiG9w0BAQEFAASCCSwwggkoAgEAAoICAQDeCJuheTkfwUSK
shHW/6XR28sohDtaA+BgE5VQhA/dO0A0OD4Y+FHFvwqDZg4j74mZ1s4BBxdercSO
l1NXmfTJvH0WhY09vSyS3g4N/T1unrtTFUTrC3Dc5ovLAxAUe2AHLGhQcXGWRbTq
pEL1KEoYG89DSisTjSBOcoM3dE8fnU2Gc7YCvLUh8IpIaYLr0GOiQumAGhxIyWGq
Comment on lines +14 to +18
# Cache directories and ML models
data/models/
data/wazuh_repo/
data/wazuh_ruleset_repo/

@@ -0,0 +1,3 @@
{"log":"03-17 16:13:38.811 1702 2395 D WindowManager: printFreezingDisplayLogsopening app wtoken = AppWindowToken{9f4ef63 token=Token{a64f992 ActivityRecord{de9231d u0 com.tencent.qt.qtl/.activity.info.NewsDetailXmlActivity t761}}}, allDrawn= false, startingDisplayed = false, startingMoved = false, isRelaunching = false","decoder":{"name":"myapp-event","parent":"myapp","prematch":"WindowManager:","regex":"(\\d+-\\d+ \\d+:\\d+:\\d+.\\d+) \\d+ \\d+ \\S WindowManager: \\S+ \\S+ wtoken = (\\.+) token=(\\.+), allDrawn= (\\S+)","order":["logtime","wtoken","token","allDrawn"],"source_file":"feedback/windowmanager.json"}}
{"log":"20171223-22:15:33:144|Step_SPUtils|30002312| getTodayTotalDetailSteps = 1514038440000##7013##548365##8661##12836##27176966","decoder":{"name":"myapp-event","parent":"myapp","prematch":"Step_SPUtils","regex":"(\\.+)\\|Step_SPUtils\\|30002312\\| getTodayTotalDetailSteps = (\\.+)","order":["logtime","getTodayTotalDetailSteps"],"source_file":"feedback/pipemetric.json"}}
{"timestamp": "2026-05-16T08:56:11.647689Z", "approved": true, "log": "May 16 14:22:31 plc-gateway01 scada-engine[2241]: ALERT Modbus unauthorized write request detected from 10.10.50.24 function_code=0x10 register=40123", "extract_fields": ["srcip", "funtion_code"], "notes": "", "decoder": {"name": "myapp-event", "parent": "myapp", "prematch": "scada-engine", "regex": "ALERT\\s+Modbus\\s+unauthorized\\s+write\\s+request\\s+detected\\s+from\\s+(\\d+.\\d+.\\d+.\\d+)\\s+function_code=(\\d+x\\d+)\\s+register=\\d+", "order": ["srcip", "function_code"], "source_file": "feedback/myapp.json"}, "target_text": "myapp-event myapp scada-engine alert\\s+modbus\\s+unauthorized\\s+write\\s+request\\s+detected\\s+from\\s+(\\d+.\\d+.\\d+.\\d+)\\s+function_code=(\\d+x\\d+)\\s+register=\\d+ srcip function_code feedback/myapp.json"}
{"timestamp": "2026-04-29T05:52:13.354712Z", "approved": false, "app_name": "myapp", "log": "[2026-04-29T04:29:06,056][INFO ][o.o.s.s.c.FlintStreamingJobHouseKeeperTask] [node-1] Starting housekeeping task for auto refresh streaming jobs.", "extract_fields": ["logtime", "loglevel", "message"], "notes": "[(\\d+-\\d+-\\S+:\\d+:\\d+,\\d+)][(\\S+)\\s][\\.+] [\\S+] (\\.+)"}
{"timestamp": "2026-04-29T08:50:41.323760Z", "approved": false, "app_name": "myapp", "log": "[2026-04-29T04:29:06,056][INFO ][o.o.s.s.c.FlintStreamingJobHouseKeeperTask] [node-1] Starting housekeeping task for auto refresh streaming jobs.", "extract_fields": [], "notes": "It should be corrected like this"}
{"timestamp": "2026-04-29T08:50:41.368350Z", "approved": false, "app_name": "myapp", "log": "[2026-04-29T04:29:06,056][INFO ][o.o.s.s.c.FlintStreamingJobHouseKeeperTask] [node-1] Starting housekeeping task for auto refresh streaming jobs.", "extract_fields": [], "notes": "It should be corrected like this"}
{"timestamp": "2026-05-16T08:56:23.312599Z", "approved": false, "app_name": "myapp", "log": "May 16 14:22:31 plc-gateway01 scada-engine[2241]: ALERT Modbus unauthorized write request detected from 10.10.50.24 function_code=0x10 register=40123", "extract_fields": ["srcip", "funtion_code"], "notes": ""}
Comment on lines +71 to +78
For this workspace, the app now defaults to:

```bash
WAZUH_SSH_HOST=192.168.56.10
WAZUH_SSH_PORT=22
WAZUH_SSH_USER=vagrant
WAZUH_SSH_PASSWORD=vagrant
```
Comment on lines +4 to +18
WAZUH_HOST = "127.0.0.1"
WAZUH_PORT = "2222"
WAZUH_USER = "vagrant"

# read from environment variable
WAZUH_LOGTEST = os.getenv("WAZUH_LOGTEST_PATH", "/var/ossec/bin/wazuh-logtest")


def run_logtest(log_line):
cmd = [
"ssh",
"-p", WAZUH_PORT,
f"{WAZUH_USER}@{WAZUH_HOST}",
f"sudo {WAZUH_LOGTEST}"
]
- Hybrid AI generation: programmatic base XML (guaranteed correct) + AI review for regex improvement
- wazuh-logtest always checked before AI generation to determine parent strategy
- Parent decoder uses <program_name> when log has a decoded program name
- Fields already decoded by built-in decoders are skipped automatically
- AI prompt refocused on reviewing/improving regex patterns instead of writing XML from scratch
- Git subprocess calls now have timeouts to prevent startup hangs
- Updated README with AI provider setup and hybrid approach documentation
…ation

- Removed Decoder Generator and Rule Generator sections from HTML
- Moved input fields (appName, logsInput, extractFields, etc.) into AI view
- Removed 'Generate Decoder' and 'Generate Rule' sidebar nav items
- Made 'AI Generate' the default active view
- Cleaned up app.js: removed unused functions (showAnalysis, showXml,
  syncFeedback, readRulePayload, rule conditions UI, old button handlers)
- Updated history loading and test function to work without decoder view
- Added POST /api/install endpoint to write decoder/rule XML to Wazuh's
  custom decoders/rules directories (SSH or local)
- Added POST /api/uninstall endpoint to remove installed files
- Added POST /api/logtest/raw endpoint for running wazuh-logtest with
  arbitrary log samples and returning raw output + parsed fields
- Redesigned Test view with three cards: Installed Decoder (install/
  uninstall), Test Logs (editable sample input), and wazuh-logtest
  Output (raw stdout + parsed fields table)
- Added state management storing installed file paths in localStorage
- AI-generated XML is now persisted in JS so it can be installed from
  the Test view without re-running AI generation
…ailure

- Add generation_mode (auto/decoder_only/rule_only/both) to AI request
- Add validate_with_logtest flag and /api/ai/generate-validated endpoint
- Add _collect_ai_response, _extract_xml_from_ai_response helpers
- Add _validate_ai_decoder_with_logtest for auto-install+test validation
- Refactor _build_ai_prompt: shorter config block, concise ML/logtest context
- Add system prompt for Ollama (system+user roles), fix URL path
- Lower default temperature to 0.05 for more deterministic output
- Default model changed to wazuh-decoder
- UI: generation mode dropdown, validate checkbox, Generate & Validate button
- UI: show validation badge & details in AI output section
- UI: hide rule section when generation_mode=decoder_only
…ndpoint and automate rule group/static field sanitization
…coring, and sigmoid calibration

- Add log-type detection (_detect_log_type) with type-based boosting to bias results toward relevant decoder families (JSON, Windows, syslog, etc.)
- Add regex token overlap scoring (_regex_overlap_score) to boost patterns whose OS_Regex tokens match query log literals
- Add sigmoid confidence calibration for well-calibrated probabilities in [0,1]
- Tune ensemble weights: TF-IDF 0.3, SBERT 0.7 (semantic model is stronger for unseen formats)
- Raise minimum confidence gate to 0.15 to avoid low-confidence noise
- Add fine-tuned SBERT checkpoint loading with graceful fallback
- Enhance tokenizer to preserve more OS_Regex character classes
… Modelfile

- Lower temperature (0.05→0.02) and top_p (0.85→0.80) for more deterministic output
- Increase repeat_penalty (1.15→1.20) and lower top_k (20→15) to reduce repetition
- Add self-validation checklist to catch common errors before output
- Add JSON log decoder and DHCP/MAC address examples
- Fix sshd example to use same decoder name for multiple children
- Add instruction: 'No text before or after' the XML block
…lization

- Default OLLAMA_BASE_URL to http://localhost:11434/v1 so it works without env vars
- Normalize /v1 suffix to prevent double-/v1 404 errors in URL construction
- Add 60s timeout to streaming client with retry on ReadTimeout (up to 3 attempts)
- Add decoder rule: multiple child decoders must use exact same decoder name
- Fix IP regex guidance: do not escape dots in \d+.\d+.\d+.\d+
- Update top_k to 15 and repeat_penalty to 1.20 to match Modelfile tuning
- Improve error messages for network/server issues
…to dataset builder

- Add load_rejection_records(): convert rejection notes with regex corrections into positive training pairs
- Add augment_with_dropout(): create robustness variants by randomly masking log tokens (15% prob)
- Rejection corrections teach SBERT to distinguish correct from broken regex patterns
- Dropout augmentation teaches model that partial log lines still map to same decoder
- Add structured logging of record counts throughout pipeline
…nting to SBERT training

- 5 epochs with best-checkpoint saving (by validation AUC)
- Larger batch size (64 configurable) for better in-batch negatives with MultipleNegativesRankingLoss
- Hard-negative augmentation: pair logs with categorically distinct decoders (30% ratio)
- Token dropout data augmentation for robustness on partial input
- Early stopping with patience=2 epochs
- Add binary evaluator with both positive and negative pairs for AUC measurement
- Configurable training device (default CPU to avoid MPS OOM with Ollama)
- Copy best checkpoint to 'final' directory for easy model loading
The sidebar defaulted to AI Generate as active, but the corresponding
#view-ai div was missing the 'active' class, so CSS display:none kept
the entire AI generation page blank on initial load.
…egex instruction

The AI model consistently escapes dots (\.) in regex patterns because it is
trained on PCRE where this is correct. Wazuh OS_Regex treats '.' as a literal
character, so \. is wrong syntax.

Fix:
- Add _sanitize_decoder_xml_osregex() that strips \. → . in generated XML
- Apply it in _extract_xml_from_ai_response and the final return
- Strengthen the Modelfile and prompt instruction with WRONG/RIGHT examples
  to make the rule impossible to miss
- Fix sanitization regex: r'\.' was matching any char after backslash
  (breaking \d, \w, etc.). Use r'\\.' to match only backslash + literal dot.
- Add Example 7 to Modelfile showing correct TrafficLog IP extraction
  with unescaped dots in OS_Regex
- Strengthen prompt WRONG/RIGHT examples for IP regex
The streaming /api/ai/generate endpoint returns raw AI text without
server-side processing, so escaped dots (\.) pass through to the browser.
Add sanitizeOsRegex() in app.js that strips \. → . client-side after
XML extraction, covering both the streaming and validated endpoints.
Add a hand-curated example with unescaped dots for IP regex
(\d+.\d+.\d+.\d+) so the fine-tuned model natively learns
correct OS_Regex IP syntax instead of relying on post-processing.
…gex instructions

- Add _stream_ai_sanitized to post-process AI output and fix \d+\.\d+ → \d+.\d+
  (common AI mistake: escaping dots before \d for IPs in Wazuh OS_Regex)
- Enhance _sanitize_decoder_xml_osregex to target IP patterns specifically,
  only removing \. between \d quantifiers, not valid \.+ any-char quantifiers
- Update Modelfile with clearer IP regex instructions and new example conversations
- Update Modelfile.finetune with more training examples (iptables, squid, UFW, TrafficLog,
  CEF Palo Alto, nginx, SSH, netfilter, KV log)
- Fix kv-log-fields decoder order to match extract_fields
…ex sanitizer

- Remove programmatic XML from ai_generate and ai_generate_validated
  — AI now generates from scratch using only analysis context
- Add _fix_osregex_bare_dot_quantifier: converts common AI mistakes
  (.+) → (\S+), .+ → \.+, .* → \.+ inside regex/prematch tags
- Update _OLLAMA_SYSTEM_PROMPT with explicit anti-pattern examples
  showing CORRECT vs WRONG OS_Regex patterns
- Strengthen _build_ai_prompt decoder_rules with OS_Regex constraints
  and anti-echo instructions
- Update Modelfile and Modelfile.finetune with anti-pattern section
…d bare-dot sanitizer

- Remove raw streaming output (#aiOut) from UI — only show final extracted XML
- Add Reference Field-to-Pattern Mapping in _build_ai_prompt: programmatic
  regex patterns as text guidance (not XML blocks AI can echo)
- Add _infer_osregex_type helper to suggest correct OS_Regex pattern per field
- Add _build_fallback_decoder: silently builds programmatic decoder when AI
  produces no valid XML (uses user inputs like field_hints)
- Add _fix_osregex_bare_dot_quantifier: sanitizes (.+) → (\S+), .+ → \.+,
  .* → \.+ inside regex/prematch tags
- Update ai_generate_validated to fall back when all retries fail
- Remove ai-stream-block CSS (unused)
…d-aid sanitization

AI now handles structure (decoder names, hierarchy, order tags) but regex
patterns come from the proven programmatic engine. _inject_programmatic_regex
matches each <regex> to the next <order> and replaces the content with the
correct regex from analysis regex_order_pairs.
- Remove unreliable bare-dot/IP sanitizers for regex content
- _extract_xml_from_ai_response accepts regex_order_pairs param
- ai_generate and ai_generate_validated pass analysis data through
- Remove _inject_programmatic_regex, _build_fallback_decoder
- Remove _FIELD_PATTERN_MAP, _infer_osregex_type
- Remove Reference Field-to-Pattern Mapping from _build_ai_prompt
- _sanitize_decoder_xml_osregex reverts to band-aid fixes only
- _extract_xml_from_ai_response no longer takes regex_order_pairs
- ai_generate and ai_generate_validated have zero programmatic fallback
AI generates everything (structure + regex) independently.
…ript

AI now generates everything (structure + regex) independently.
Only band-aid sanitization remains: (.+) → (\S+), .+ → \.+,
\d+\.\d+ → \d+.\d+.

Add scripts/train_osregex.py — extracts 26 training pairs from
Modelfile.finetune into JSONL format and provides training commands
for Ollama 0.5+, Unsloth, llama.cpp, and Axolotl.
…regex correction

- scripts/generate_finetuning_data.py: downloads all 104 decoder + 133
  rule XMLs from wazuh-ruleset, generates 806 training pairs
  (725 train / 81 val) in JSONL format
- app/main.py: _inject_correct_regex silently replaces AI <regex>
  content with analysis-derived patterns; _INTERNAL_FIELD_REGEX
  maps field names to correct OS_Regex
- scripts/train_osregex.py: points to new 806-example dataset
- .gitignore: add .cache_decoders/
…refix regex generation

- sanitizeOsRegex disabled: backend _inject_correct_regex already fixes patterns
- build_split_regexes_from_fields: better first-field prefix handling (no \.+ prefix for start-of-log fields); use (\.+) for multi-word/multi-token field values
…on; expand CEF field aliases

- app.js: checkExistingDecoder() calls /api/analyze first and shows
  confirm() dialog if a builtin decoder already matches
- main.py: add 'source', 'destination', 'port' aliases for CEF field mapping
…gex token patterns

- AI prompt: provide correct prematch when no program_name is pre-decoded
- AI prompt: add both <program_name> and <prematch> strategy examples
- decoder_ml_enhanced.py: fix over-escaped regex tokens in enhanced tokenizer
- wazuh_logtest.py: use env vars with fallback defaults instead of hardcoded values
- .gitignore: exclude certs/, *.pem, *.key, *.crt
- README.md: make SSH config docs generic
@MiguelCasaresRobles MiguelCasaresRobles merged commit 6e6799b into wazuh:main Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants